Skip to content

docs: EP-1270 Authorization (access control) design proposal#2075

Open
davidkarlsen wants to merge 2 commits into
kagent-dev:mainfrom
davidkarlsen:feat/ep-authz
Open

docs: EP-1270 Authorization (access control) design proposal#2075
davidkarlsen wants to merge 2 commits into
kagent-dev:mainfrom
davidkarlsen:feat/ep-authz

Conversation

@davidkarlsen

Copy link
Copy Markdown
Contributor

Summary

Adds an Enhancement Proposal for authorization (access control) in KAgent — issue #1270.

Today the controller ships with NoopAuthorizer, so once a user is authenticated they can list, invoke, edit and delete every Agent, ModelConfig and ToolServer across every namespace. Enabling OIDC (#1293) gives authentication but no access control. This EP proposes the fine-grained authorization that EP-476 explicitly deferred.

Approach

The earlier #1270 discussion stalled on a design tension: an opinionated in-process RBAC engine vs. a pluggable extension point. The EP proposes CEL as the resolution — it's both:

  • In-process default, no new SPOF (cel-go is already in our module graph), and
  • Not a hard-coded RBAC model — policy is an expression over claims/verb/resource, so groups are one option among many and the project isn't married to one engine.

The auth.Authorizer interface stays the seam, so an external/OPA authorizer (#1370) remains pluggable. Per-resource policy lives on the Agent CR, compiled via reconciliation (cached, validated onto status.conditions), enforced centrally. Builds on the stalled prototypes in #1766 (per-agent annotation + list filtering + A2A gating) and #1370 (external authorizer interface) rather than starting over.

Design comment that led here: #1270 (comment)

Status

provisional — following the "merge early and iterate" guidance in the EP template. High-level direction is the goal; details (per-resource carrier, policy-combining semantics, default-deny behavior) are flagged as Open Questions / UNRESOLVED for discussion.

Looking for a maintainer sponsor and a directional 👍 on "CEL as the default, behind the existing interface."

/cc @EItanya @peterj

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings June 23, 2026 10:57
@github-actions github-actions Bot added documentation Improvements or additions to documentation enhancement-proposal Indicates that this PR is for an enhancement proposal labels Jun 23, 2026
@davidkarlsen

Copy link
Copy Markdown
Contributor Author

@EItanya @peterj PTAL

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Enhancement Proposal (EP-1270) documenting a design for introducing fine-grained authorization (access control) in KAgent, centered on CEL-based policy evaluation while preserving the existing auth.Authorizer seam for pluggable implementations.

Changes:

  • Introduces EP-1270 documenting current authorization gaps and the proposed CEL-based default authorizer.
  • Specifies a policy model, decision context, and rollout strategy (opt-in, fail-closed, cached compilation).
  • Outlines operational considerations (list filtering, A2A gating) and an initial test plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread design/EP-1270-Authorization.md Outdated
davidkarlsen and others added 2 commits June 23, 2026 15:28
Proposes a real Authorizer to replace the open-by-default NoopAuthorizer:
CEL-based, in-process, behind the existing auth.Authorizer interface, with
per-resource policy on the Agent CR compiled via reconciliation and a
default-deny model. Builds on the stalled prototypes in kagent-dev#1766 and kagent-dev#1370.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: David J. M. Karlsen <david@davidkarlsen.com>
Address PR review: ProxyAuthenticator only populates Principal.Claims for
direct user calls; the agent-call path (X-Agent-Name) sets User/Agent but not
Claims. Qualify the Background statement and strengthen Open Question kagent-dev#5 — a
claims-only fail-closed policy would deny internal agent/M2M traffic, so the
model needs an agent-identity match or a separate M2M lane.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: David J. M. Karlsen <david@davidkarlsen.com>
@davidkarlsen

Copy link
Copy Markdown
Contributor Author

@dimetron PTAL?

@dimetron

Copy link
Copy Markdown
Contributor

The proposal picks the right engine (CEL) and reuses the existing
auth.Authorizer.


What's good (approve)

  • CEL is the right engine. It runs in-process, it is sandboxed and guaranteed to
    terminate, it evaluates in microseconds, and cel-go is already in the module
    graph, so it adds no new single point of failure.
  • It keeps auth.Authorizer as the extension seam, so the external OPA option
    (External auth init #1370) stays a drop-in.
  • Compiling policy at reconcile time into a generation-keyed cache, with a lazy
    fallback on cache miss, is the right pattern for a controller.
  • The default path is fail-closed, opt-in, and backward compatible, and a bad
    policy shows up on status.conditions as AccessPolicyValid.
  • It builds on the stalled prototypes (feat: group-based agent authorization via OIDC groups claim #1766 for list filtering and the A2A
    gate, External auth init #1370 for the external authorizer) instead of starting over.
  • The non-goals and alternatives are honest. Casbin, OPA-only, a bespoke DSL,
    and SubjectAccessReview each get dismissed with a real reason.

What he missed (must address before accepted)

  1. Coverage is overstated. The EP says authz is "wired into every handler
    (~25 call sites)", but only about 8 of the ~22 handler areas gate anything
    today. Sessions, memory, tasks, model-provider config, companion secrets, and
    A2A invoke are all ungated, and those are the most sensitive surfaces. The EP
    needs an honest coverage matrix (below) and a commitment to close them.
  2. M2M and agent calls carry no claims, so a fail-closed claims-based policy
    would deny all internal A2A traffic. That is a blocker, not an open question.
    The cleanest fix is workload identity (see below), which also removes the
    spoofable X-Agent-Name header the agent path trusts today.
  3. Policy combining is left as UNRESOLVED, but it is core semantics and nothing
    can be implemented without it. Pin it down: default-deny, allow if either the
    central policy or the matching per-resource policy permits, and only consult
    a per-resource policy for its own resource. That last rule is what
    structurally guarantees the widen-only invariant, so the EP should say so.
  4. There is no invoke verb. A2A invocation collapses onto get, so a policy
    cannot tell "may read" apart from "may run". Add VerbInvoke.

M2M: use workload identity, keep one policy model

Today the agent path identifies the caller from the unverified X-Agent-Name
header and sets no claims. Instead, let the M2M caller present a verified
identity token and bind it into the principal. Any of these works, in order of
how native they are to the stack:

  • A Kubernetes projected ServiceAccount token, with sub
    system:serviceaccount:<ns>:<sa>, audience-bound to kagent.
  • A SPIFFE JWT-SVID, with sub spiffe://<trust-domain>/ns/<ns>/sa/<sa>.
  • Istio's default mTLS (X.509-SVID), where the sidecar forwards the peer SPIFFE
    ID in the X-Forwarded-Client-Cert header.

The authorizer then populates Principal.Claims and Principal.Agent.ID from
that verified identity, so the same CEL model covers humans and machines with no
separate lane:

// SPIFFE JWT-SVID
claims.sub == "spiffe://kagent.local/ns/kagent/sa/agent-runner"

// or a projected Kubernetes ServiceAccount token
claims.sub == "system:serviceaccount:kagent/agent-runner"

This resolves the M2M blocker and the spoofable-header risk at once. Keep it
simple in the first cut: pick one carrier (projected SA token is the most
native), verify the audience, parse the identity into structured fields once,
and leave on-behalf-of user delegation (#2071 / STS) for later. Workload
identity tells you which agent called, not on whose behalf.


Enforcement model: middleware vs per-handler (root cause of gap #1)

The coverage gap is not really "14 handlers forgot a check". It is the
enforcement model. Authz today is opt-in per handler, with Check() scattered
across 8 files, which makes it default-open: any route without a Check() call
is silently a bypass. A one-time audit fixes today's snapshot, but it rots the
moment someone adds a route.

Recommendation: add a deny-by-default authz middleware as the backstop, and keep
per-handler Check() for the cases the middleware cannot cover. A route that is
not explicitly mapped to a (resourceType, verb) is denied.

This fits the code as it stands:

  • The server already runs a middleware chain (s.router.Use(...)) with an
    AuthnMiddleware sibling, so an AuthzMiddleware slots in right after it
    (go/core/internal/httpserver/server.go:356-360).
  • Routes use gorilla/mux with {namespace} and {name} vars, so the middleware
    can read the resource name and namespace from mux.Vars(r) and the verb from
    the HTTP method (the same switch handlers.Check already uses).

What stays in the handler, because the middleware cannot do it:

Concern Where Why
Coarse gate, "may you touch this resource type and verb at all" Middleware (deny-by-default) One chokepoint; closes the gap structurally
List filtering, per returned item Handler Needs the response set, not just the request
Create where name and namespace are in the body Handler Middleware sees path vars, not the decoded body
Per-resource policy combining (Agent spec.accessPolicy) Handler Needs the fetched resource plus central-vs-resource combining
Non-uniform routes (A2A /{ns}/{name}, sessions and memory keyed by agent) Both, with an explicit route entry Resource identity is not inferable from the path shape alone

This needs two pieces: a declarative registry that maps each route to its
resource type, verb, and whether it is public, and an explicit public allowlist
(/health, /version, and the self-scoped /api/user) so probes and
self-calls keep working.

Why it is worth the refactor: a missing registry entry fails closed (the request
is denied), whereas a missing Check() today fails open (the request goes
through). That asymmetry is the whole reason to do it. The same middleware also
covers the A2A PathPrefix handler (server.go:347) instead of leaving it to a
separate hand-wired gate. The EP should state this enforcement-model choice
explicitly rather than inheriting the implicit per-handler one.


Smaller suggestions

  • Keep authorizer set to noop whenever auth.mode=unsecure. Dev clusters
    have no claims, so a claims-based policy would lock them out.
  • Add Namespace to auth.Resource, or normalize it once inside the
    authorizer, so handlers stop re-parsing namespace/name.
  • Add an observability section: decision metrics
    (kagent_authz_decisions_total{result,resource_type},
    kagent_authz_config_valid) and deny logging at V(1) that never prints
    claim values.
  • Add a threat-model and trust-boundary paragraph. The proxy validates the JWT
    and the controller trusts the proxy. List what authz does not cover (direct
    pod access in Secure the kagent UI #2028, and the ungated endpoints).
  • Add break-glass and bootstrap-admin guidance so turning authz on against a
    live cluster does not lock everyone out.
  • Note that the Casbin sections in EP-476 are superseded by EP-1270.
  • Note the size limits: a CEL string in spec.accessPolicy counts against the
    etcd object limit, the central ConfigMap is capped at 1 MiB, and the
    compiled-program cache currently only evicts on delete.

Authorization gates matrix

Verified against main by counting Check() and authorizeAgentRequest calls
per handler file in go/core/internal/httpserver/handlers.

Handler Routes (examples) Gates today Sensitivity Risk if CEL enabled
agents.go /api/agents/* yes, ~12 (incl. authorizeAgentRequest) high covered
modelconfig.go /api/modelconfigs/* yes, 5 high (cred refs) covered
prompttemplates.go /api/prompttemplates/* yes, 5 medium covered
toolservers.go /api/toolservers/* yes, 4 medium covered
toolservertypes.go /api/toolservertypes/* yes, 1 low covered
mcpapps.go /api/mcpapps/* yes, 1 low covered
substrate.go /api/substrate/* yes, 1 medium covered
sessions.go /api/sessions/* none high (conversation content) bypass
memory.go /api/memories/* none high (embeddings, PII) bypass
tasks.go /api/tasks/* none high (task data) bypass
checkpoints.go LangGraph checkpoints none high (state) bypass
modelproviderconfig.go /api/modelproviderconfigs/* none high (credential-adjacent) bypass
models.go /api/models none medium bypass
namespaces.go /api/namespaces none medium (enumeration) bypass
tools.go /api/tools none low to medium bypass
feedback.go /api/feedback/* none low bypass
crewai.go CrewAI routes none medium bypass
agentharness_gateway.go /api/agentharnesses/... none medium bypass
agentharness_session.go harness sessions none medium bypass
companion_secrets.go companion secrets none high (secrets) bypass
current_user.go /api/user none low (self) acceptable
health.go /health, /version none (by design) none acceptable
A2A invoke /api/a2a/{ns}/{name} authn only (A2AAuthenticator), no Authorizer high (direct agent run) bypass

Summary: about 8 of the ~22 handler areas gate authz today, all of them CRUD on
Agent, ModelConfig, ToolServer, and PromptTemplate. The other half includes the
most sensitive surfaces: sessions, memory, companion secrets, model-provider
config, and A2A invoke. Turning on CELAuthorizer without the coverage work
leaves those as bypass paths, which is why the "wired into every handler" line
has to become a real coverage matrix and a scope commitment.


Bottom line

The direction is right: CEL as the default behind the existing interface. Merge
it as provisional. Before it moves to accepted:

  1. Replace the coverage claim with the matrix above, commit to closing the
    sensitive gaps, and adopt a deny-by-default authz middleware (hybrid with
    per-handler Check()) so the gap cannot come back.
  2. Move M2M principals and policy combining out of "Open Questions" and into
    resolved design. For M2M, adopt verified workload identity (projected SA
    token or SPIFFE) bound into the principal, not the X-Agent-Name header.
  3. Add the observability and threat-model sections.

References

Related PRs and issues:

  • #1766: per-agent
    kagent.dev/allowed-groups annotation, GroupAuthorizer, agent-list
    filtering, and A2A request gating (stalled on inactivity).
  • #1370: pluggable external
    authorizer (OPA-style webhook) behind the Authorizer interface (stalled).
  • #1293: OIDC proxy
    authentication (EP-476), which added the trusted-proxy mode this EP builds on.
  • #2071: STS and delegated
    identity, relevant to propagating caller claims on the agent path.
  • #2028: network gating
    (HTTPRoute, NetworkPolicy, OpenShift Route), the out-of-scope edge controls.

Source locations (verified against main):

  • NoopAuthorizer.Check returns nil: go/core/internal/httpserver/auth/authz.go.
  • Authorizer interface, the Verb set (get/create/update/delete),
    Resource{Name, Type} with no Namespace field, and Principal.Claims:
    go/core/pkg/auth/auth.go (Verb at lines 9-16, Resource at 18-21,
    Principal at 32-36).
  • Central Check helper and the HTTP-method-to-verb switch:
    go/core/internal/httpserver/handlers/helpers.go:56-80.
  • Middleware chain, where an AuthzMiddleware would sit next to
    AuthnMiddleware: go/core/internal/httpserver/server.go:356-360.
  • A2A registered as a PathPrefix handler with authentication only, no
    authorizer: go/core/internal/httpserver/server.go:347.
  • cel-go already in the module graph (indirect today, promote to direct when
    implementing): go/go.mod.
  • Agent status conditions, where AccessPolicyValid would be reported:
    go/api/v1alpha2/agent_types.go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement-proposal Indicates that this PR is for an enhancement proposal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants